256
17
Genomics
assigned to the genes or groups of genes whose transcription they control. Other
tasks include the identification of those genes (in humans, mammals, etc.) believed
to originate from viruses and the localization of hypervariable regions (e.g., those
coding for immunoglobulins). Ultimately, the aim is to be able to understand the
relationships among the various elements of the genome.
Gene prediction can be divided into intrinsic (template) and extrinsic (lookup)
methods. The former are the best candidates for leading to fundamental insight into
how the gene works; if they are successful, they should furthermore then inevitably
provide the means to generalize from the biochemistry of natural sequences to yield
rules for designing new genes (and genomes) to fulfil specified functions. We shall
begin, however, by considering the conceptually simpler extrinsic methods.
17.4
Extrinsic Methods
The principle of the extrinsic or lookup method is to identify a gene by finding a
sufficiently similar known object in existing databases. Hence, the method is based
on sequence similarity (to be discussed in Sect. 17.4.2), using the still relatively
small core of genes identified by classical genetic and molecular biological studies
to prime the comparison; that is, a gene of an unknown function is compared with the
database of sequences with a known function. This approach reflects a widely used,
but not necessarily correct (or genuinely useful), assumption that similar sequences
have similar functionalities. 15 A major limitation of this approach is the fact that,
at present, about a third of the sequences of newly sequenced organisms turn out to
match no sufficiently similar known sequences in existing databanks. Furthermore,
errors in the sequences deposited in databases can pose a serious problem.
17.4.1
Database Reliability
An inference, especially a deductive one, drawn from data is only as good as the data
from which it is formed. The question of the reliability of the data is certainly a matter
for legitimate concern. The most pernicious errors are wrong nucleic acid bases in
a sequence. The sources of such errors are legion and range from experimental
uncertainties to mistakes in typing the letters into a file using a keyboard. Of course,
these errors can be considered as a source of noise (i.e., equivocation) and handled
with the ideas developed earlier, especially in Chap. 7. Undoubtedly, there is a certain
redundancy in the sequences, but these questions of equivocation and redundancy in
15 Note that “homology” is defined as “similarity in structure of an organ or molecule, reflecting
a common evolutionary origin”. Sequence similarity is insufficient to establish homology, since
genomes contain both orthologous (related via common descent) and paralogous (resulting from
duplications within the genome) genes.